JOPSS - Search Results

Search Results: Records 1-2 displayed on this page of 2

Presentation/Publication Type

Initialising ...

Refine

Journal/Book Title

Initialising ...

Meeting title

Initialising ...

First Author

Initialising ...

Keyword

Initialising ...

Language

Initialising ...

Publication Year

Initialising ...

Held year of conference

Initialising ...

Oral presentation

Study on acceleration of locally mesh-refined lattice Boltzmann simulation using GPU interconnect technology

Hasegawa, Yuta; Onodera, Naoyuki; Idomura, Yasuhiro

no journal, ,

To reduce memory usage and accelerate data communication in the locally-refined lattice Boltzmann code, we tried an intra-node multi-GPU implementation using Unified Memory in CUDA. In the microbenchmark test with uniform mesh, we achieved 96.4% and 94.6% parallel efficiency on weak and strong scaling of a 3D diffusion problem, and 99.3% and 56.5% parallel efficiency on weak and strong scaling of a D3Q27 lattice Boltzmann problem, respectively. In the locally mesh-refined lattice Boltzmann code, the present method could reduce memory usage by 25.5% from the Flat MPI implementation. However, this code showed only 9.0% parallel efficiency on strong scaling, which was worse than that on the Flat MPI implementation.

Oral presentation

Enhancing intra-node Multi-GPU stencil calculations on DGX-2 using concurrent-addressing with Unified Memory

Hasegawa, Yuta; Onodera, Naoyuki; Idomura, Yasuhiro

no journal, ,

In the "CityLBM" project at JAEA, a real-time AMR (adaptive mesh refinement)-based urban wind prediction code was developed. Towards the next generation of CityLBM code, ensemble simulations are needed to improve the reliability of the prediction. For this purpose, the memory usage should be shrunk into a single node or 4-16 GPUs per simulation. To reduce the memory usage and accelerate data communication in the AMR code, we tried an intra-node multi-GPU implementation using Unified Memory in CUDA. This approach enables easy parallel GPU implementation, because the access to Unified Memory is automatically managed via HBM2 (self GPU) or NVLink (neighbor GPU). We implemented multi-GPU calculations for a 3D diffusion equation and a lattice Boltzmann equation on uniform mesh, and tested weak/strong scalability and the performance of NVLink.